Note: Grading is based both on your graphs and verbal explanations. Follow all best practices as discussed in class, including choosing appropriate parameters for all graphs. Do not expect the assignment questions to spell out precisely how the graphs should be drawn. Sometimes guidance will be provided, but the absense of guidance does not mean that all choices are ok.

Read Graphical Data Analysis with R, Ch. 4, 5

1. House features

[5 points]

Data: ames in the openintro package

  1. Create a frequency bar chart for the roof styles of the properties.
library(openintro)
library(tidyverse)
library(scales)
library(dplyr)
library(plotly)
data(ames)
c<-ggplot(ames, aes(x=reorder(Roof.Style, Roof.Style, function(x)-length(x)))) + geom_bar(fill = "lightblue")+ theme_classic()+  
  labs(x='Roof Styles', y ='Count') +
  ggtitle("Count of each roof style") +
  theme(plot.title = element_text(hjust = 0.5))
c

  1. Create a frequency bar chart for the variable representing the month in which the property was sold.
data(ames)
ames$Mo.Sold <- as.Date(paste0("2018-", ames$Mo.Sold, "-1"))
c<-ggplot(ames, aes(x=Mo.Sold)) + geom_bar(fill = "lightblue")+ theme_classic()+  
  labs(x='Months', y ='Count') +
  ggtitle("Count of each Month") +
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_date(labels = date_format("%b"))
c

  1. List all the factor variables that have "Ex" "Fa" "Gd" "Po" "TA" as levels.
data(ames)
ls <- c()
vals<-c("Ex","Fa","Gd","Po","TA")
for(i in 1:ncol(ames)){
  if (all(vals %in% sapply(ames[,i],levels))){
    ls<-c(ls,i)
  }
}
colnames(ames[,ls])
## [1] "Exter.Cond"   "Bsmt.Qual"    "Bsmt.Cond"    "Heating.QC"   "Kitchen.Qual"
## [6] "Fireplace.Qu" "Garage.Qual"  "Garage.Cond"
  1. Create faceted bar charts using facet_wrap() to display the frequency distribution of all variables from part c). (Hint: transform the data first with pivot_longer())
library(data.table)
data(ames)
df<-ames
colnames=c('PID',"Exter.Cond","Bsmt.Qual","Bsmt.Cond","Heating.QC","Kitchen.Qual","Fireplace.Qu","Garage.Qual","Garage.Cond")
df<-df[colnames]
df%>%pivot_longer(cols=!PID,names_to='Type',values_to = "num")%>%na.omit()->df
df<-df[df$num != "", ]
ggplot(df, aes(x=reorder(num, num, function(x)-length(x))))+
  geom_bar(color='black')+
  facet_wrap(~Type, nrow=2)+
  ggtitle("Frequency distribution of select variables") +
  theme(plot.title = element_text(hjust = 0.5))+
  labs(x='Levels', y ='Count')

2. Pet names

[12 points]

Data: seattlepets in the openintro package

  1. Create separate Cleveland dot plots for the 30 most popular dog names and 30 most popular cat names.
data("seattlepets")
df<-seattlepets
df<-df[!is.na(df$animal_name), ]
df %>% filter(species == 'Dog')%>%group_by(animal_name) %>%tally(sort=TRUE)%>%slice(1:30) -> df_dog
df %>%filter(species == 'Cat')%>%group_by(animal_name) %>%tally(sort=TRUE)%>%slice(1:30) -> df_cat

ggplot(df_dog, aes(x = n, y=reorder(animal_name,n))) +
  geom_point()+ggtitle("Cleveland plot for Dogs") +
  theme(plot.title = element_text(hjust = 0.5))+
  labs(x='Count', y ='Names')+theme_linedraw()

ggplot(df_cat, aes(x = n, y=reorder(animal_name,n))) +
  geom_point()+ggtitle("Cleveland plot for Cats") +
  labs(x='Count', y ='Names')+theme_linedraw()

  1. Use a Cleveland dot plot to display the 30 names that are the most “dog” measured by the proportion of all animals with that name that are dogs. (You can remove goat and pig names from the dataset.) Clearly state any decisions you make about what to include and not include and explain your reasoning.
df2<-df[!(df$species=="Pig" | df$species=="Goat"),]
df2<-df2[!is.na(df2$animal_name), ]
df2<-df2[,c('species','animal_name')]
df2%>%group_by(animal_name,species)%>%tally()%>%filter(n()>1)%>%mutate(freq = n / sum(n))%>%arrange(desc(freq))%>%filter(species == 'Dog')->df3

df3=df3[0:30,]

ggplot(df3, aes(x = freq, y = reorder(animal_name, freq))) +
  geom_point()+ggtitle("Cleveland plot for most DOG names") +
  theme(plot.title = element_text(hjust = 0.5))+
  labs(x='Count', y ='Names')+theme_linedraw()

  1. Find the 30 most popular names for dogs and cats combined, and create a multidot Cleveland dot plot showing the counts for dogs, cats, and total for each of these 30 names. (One color for dogs, one color for cats, one color for total.) Order the dots by the total count.
df_total=df[!is.na(df$animal_name), ]
df_total%>%group_by(animal_name)%>%tally()%>%arrange(desc(n))%>%slice(1:30)->df_total
df %>% filter(species == 'Dog')%>%group_by(animal_name) %>%tally(sort=TRUE)->df_dog
df %>% filter(species == 'Cat')%>%group_by(animal_name) %>%tally(sort=TRUE)->df_cat
merge(
  x=df_dog,
  y=df_total,
  by.x='animal_name',
  by.y='animal_name'
)->df_total
merge(
  x=df_cat,
  y=df_total,
  by.x='animal_name',
  by.y='animal_name'
)->df_total

ggplot(df_total)+geom_point(aes(x=n.y,y=reorder(animal_name,n.y)),color='red')+geom_point(aes(x=n.x,y=animal_name),color='blue')+geom_point(aes(x=n,y=animal_name),color='green')+ggtitle("Cleveland plot for Popular names") +
  labs(x='Count', y ='Names')+theme_linedraw()

  1. Create a scatterplot of popular cat names vs. popular dog names. Clearly some names are more “dog” names and some are more “cat” names. Decide on a metric for defining what is a “dog” name, a “cat” name, and a “neutral” name and state it explicity. What is your metric?
merge(
  x=df_cat,
  y=df_dog,
  by.x='animal_name',
  by.y='animal_name'
)->df_pop

ggplot(df_pop)+geom_point(aes(x=n.y,y=n.x),alpha=0.5,colour="black")+geom_abline(intercept=0, slope=1)+ggtitle("Scatterplot for Cat names vs Dog names") +
  labs(x='Number of Dogs with the name', y ='Number of Cats with the name')+theme_linedraw()

Approach

  • In order to measure and classify a name as belonging to a ‘dog’,‘cat’ or ‘neutral’, the frequency and hence the ratio of each animal category that shares the name is examined.
  • We keep a threshold ratio of 0.1 on either side of equality to decide whether the name belongs to a cat or a dog.
  • If points lie within the threshold space, then it can be classified as neutral.
  • The reasoning behind this is the paucity of names which share exactly the same number of cats and dogs.
  • Therefore the ratio - #Number of cats with a name:#Number of dogs with the same name is, Ratio >= 1.1 —> Popular cat name 0.9 < Ratio < 1.1 —> Neutral name. Ratio < 0.9 —> Popular dog name
  • It follows the logic that any point lying significantly above the y=x line is a ‘cat’ name and anything below a ‘dog’ name. If the point lies very close to the line, it is ‘neutral’.
  1. Create a new variable for type of name (“dog”, “cat” or “neutral”) and redraw the scatterplot coloring the points by this variable. Label individual points as you see fit (don’t label all of them.)
df_pop$frac<-df_pop$n.x/df_pop$n.y
ggplot(df_pop)+geom_point(aes(x=n.y,y=n.x, text=animal_name),col = ifelse(df_pop$frac < 0.90, 'red',ifelse(df_pop$frac>1.10,'green','blue')))+geom_abline(intercept=0, slope=1)+ggtitle("Scatterplot for Cat names vs Dog names") +
  theme(plot.title = element_text(hjust = 0.5))+
  labs(x='Number of Dogs with the name', y ='Number of Cats with the name')->x
ggplotly(x,tooltip='text') %>% config(displayModeBar = F)

NOTE

The convention for ‘neutral’, ‘cat’, ‘dog’ points are colored blue,green and red respectively. We used an interactive plot so each point can be examined better. This increases visibility in dense areas as well. On hovering cursor over the point you can see the animal name and the color of its class.

  1. What are your most interesting discoveries from this dataset?

Interesting discoveries

  • Firstly, there are a lot of names that belong to more number of dogs than cats in comparison with names that belong to cats than dogs following the above definition of dog name, cat name and neutral. Hence the dataset is imbalanced.
  • Moreover, looking at the last graph reveals that there are a lot more options for dog names as well. There are also some names that are more than 95% dog dominated.
  • The scatterplot for cat and dog names also reveals the dense nature of the plot for lower frequencies which is indicated at the bottom left of the scatterplot. This implies that there exists names which are either dog or cat that have been less frequently used.
  • The count of names that have been more frequently used for an animal is less. This is clearly indicated by the lighter portions of the plot at the top right and left center of the graph.

3. House sizes and prices

[6 points]

Data: ames in the openintro package

For all, adjust parameters to the levels that provide the best views of the data.

Draw four plots of price vs. area with the following variations:

base <- ames %>% ggplot(aes(x=area, y=price/1000))
  1. Scatterplot – adjust point size and alpha.
scatter <- base +
  geom_point(size=1.5, alpha=0.4, colour="darkblue") +
  xlab('Living Area in square feet') +
  ylab('Price (in thousand dollars)') +
  theme_bw()
scatter

  1. Scatterplot with density contour lines
scatter +  
  geom_density2d(size=0.6, colour="lightblue", bins=8)+
  xlab('Living Area in square feet') +
  ylab('Price (in thousand dollars)') +
  ggtitle('Scatterplot of price vs area with a focus on density contour lines') +
  theme_bw()

  1. Hexagonal heatmap of bin counts
base + 
  geom_hex(bins=15) +
  scale_fill_gradient(low = "skyblue", high = "darkblue") +
  xlab('Living Area in square feet') +
  ylab('Price (in thousand dollars)')+
  ggtitle('Hexagonal heatmap of price vs area') +
  theme_bw()

  1. Square heatmap of bin counts
  base+geom_bin2d(bins=15) +
  scale_fill_gradient(low = "skyblue", high = "darkblue") +
  xlab('Living Area in square feet') +
  ylab('Price (in thousand dollars)')+
  ggtitle('Square heatmap of price vs area') +
  theme_bw()

  1. Describe noteworthy features of the data, using the “Movie ratings” example on page 82 (last page of Section 5.3) as a guide.

Features

  • The housing pricing in general has an upward trend and hence is positively correlated with the living area of the house. Majority of the houses follow this trend as bolstered by the dark spot in between the plot.
  • However there are also two sets of outliers in the set
    • The first set of three houses with low housing price with respect to very large areas relative to other houses which have greater price for areas less than the outliers.
    • Second set are two points on the top right for two houses with very large price values for lesser area in comparison with other houses that have similar areas.
  • Most of the density is concentrated with houses priced between $50000-$300000 and 500-2500 sq ft area
  • There aren’t any houses which are priced extremely high but with low living area. Hence the diagonal half of the graph on the leftmost corner is empty. All of them seem to be below a boundary line, i.e. price < slope*area + y-intercept.

4. Correlations

[7 points]

Data: ames in the openintro package

  1. Recreate the scatterplot from part 3 (price vs. area) this time faceting on Neighborhood (use facet_wrap(). Add best fitting lines and sort the facets by the slope of the best fitting line from low to high. (Use lm() to get the slopes.)
ames_lm <- ames %>% 
  group_by(Neighborhood) %>% 
  group_modify(~ broom::tidy(lm(price ~ area, data=.x))) %>%
  filter(term == "area")
ames$slope <- ames_lm$estimate[match(ames$Neighborhood,ames_lm$Neighborhood)]
ames %>% ggplot(aes(x=area/1000, y=price/1000)) +
  geom_point(size=2, alpha=0.4) +
  xlab('Living Area per 1000 square feet') +
  ylab('Price (in thousand dollars)') +
  facet_wrap(.~reorder(Neighborhood, slope), nrow = 4) +
  geom_smooth(method = 'lm', formula = y~x)+
  ggtitle('Price vs Area Scatterplot facetted by Neighbourhood and ordered by slope') 

  1. Is the slope higher in neighborhoods with higher mean housing prices? Present graphical evidence and interpret in the context of this data.
ames_mean <- ames %>%
  group_by(Neighborhood) %>%
  summarise_at(vars(price), list(mean_price = mean))
ames_mean$slope <- ames_lm$estimate[match(ames_mean$Neighborhood,ames_lm$Neighborhood)]
ames_mean %>%
  ggplot(aes(x=slope, y=mean_price)) +
  geom_point() +
  ggtitle('Mean price vs Slope Scatterplot with a data point representing a Neighbourhood') +
  theme_bw()

  1. Repeat parts a) with the following adjustment: order the faceted plots by \(R^2\) from the linear regression of price on area by Neighborhood. Is the \(R^2\) higher in neighborhoods with higher mean housing prices? Are the results the same for slope and \(R^2\)? Explain using examples from the graphs.
r_squared_calc <- function(dat_in, area='area', price='price'){
  return(summary(lm(price ~ area, data=dat_in))$r.squared)
} 
ames <- ames %>% 
  group_by(Neighborhood) %>%
  do(data.frame(area = .$area,
                price = .$price,
                r_sqrd = r_squared_calc(.)))
ames %>% ggplot(aes(x=area/1000, y=price/1000)) +
  geom_point(size=2, alpha=0.4) +
  xlab('Living Area per 1000 square feet') +
  ylab('Price (in thousand dollars)') +
  facet_wrap(.~reorder(Neighborhood, r_sqrd), nrow = 4) +
  geom_smooth(method = 'lm', formula = y~x)

ames_mean$r_sqrd <- ames$r_sqrd[match(ames_mean$Neighborhood,ames$Neighborhood)]
ames_mean %>%
  ggplot(aes(x=r_sqrd, y=mean_price)) +
  geom_point() +
  ggtitle('Price vs Area Scatterplot facetted by Neighbourhood and ordered by R^2') +
  theme_bw()

Inferences

  • The results of \(R^2\) vs slope is different from mean prices vs slope. Apparently there seems to be no correlation between the \(R^2\) metric of the best fit line and the mean prices of a neighborhood. Hence we can’t say much about the behavior of \(R^2\) with higher mean prices.
  • The points are just spread across in a space with no clear trend making it difficult to give a trend analysis.
  • There is just one single data point/neighborhood that has an \(R^2=1\) and this can be corresponding to th neighborhood that has just two houses. Since a line can fit perfectly well for two data points, the \(R^2\) metric gives one.
  • For \(R^2\) values close to ~0.6, there seems to be several instances of both high mean prices and relatively low mean prices and hence rendering no proper correlation.
  • Only two data points that have \(R^2\) close to zero have relatively low mean prices. Although some examples of even lower mean prices can be seen with \(R^2\) close to 0.5.